Metagenome Analysis Report

This report is not a diagnostic / clinical report and is intended for Research Use Only!


Eurofins Project ID: EF-DEMO
Date of Processing: 24 March, 2025
Pipeline: Metagenome Analysis Pipeline
Version: v2.4.6


Analysis Workflow

Schematic diagram showing the main steps of the analysis method followed to perform the data analysis.

Sequence Quality Control

Quality Control of raw sequencing data

Raw sequencing data are preprocessed to generate clean data for downstream analysis.

Quality of raw sequencing data is checked and filtered to retain only high quality bases by performing adapter trimming, quality filtering and per-read quality pruning. Quality is interpreted as the probability of an incorrect base call or, equivalently, the base call accuracy. The quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99% and a quality score of 30 reflects a base call accuracy of 99.9%. These probability values are the results from the base calling algorithm and depend on how much signal was captured for the base incorporation.

Sequencing reads representing reads with quality score at least Q30 is above 90% is of very good quality. For a reasonably good sample source material, according to Illumina specifications, one could expect >75% reads with at least Q30 Phred quality.

Raw sequencing data is processed using fastp[2] software to remove poor quality bases (below Phred Quality 20) using the sliding window approach where in if the average quality of the bases drops below Q20, those bases are removed from the reads. After quality trimming, program checks for presence of any adapters in the reads and removes from the reads. Further, shorter reads which are <30bp length are also removed to retain only high quality sequencing reads for each sample in the analysis. In case of paired-end reads, both the sequencing reads which pass the QC criteria are considered for downstream analysis.

After QC processing, QC metrics such as Q30 reads and GC content can be used to assess the sequencing and sample quality across the samples.

Read Statistics

  • Table 1: Sequence Quality Metrics overview. For each sample, the following QC metrics are provided:
    • Sample Name: name of the sample.
    • Total Raw Reads: the total number of raw sequencing reads generated for the sample.

    • Total HQ Reads: the total number of high quality reads after sequence cleaning and filtering.
    • HQ Bases (Q30): Percentage of high quality bases having at least phred quality 30.
    • GC Content: GC content in percentile of high quality sequencing reads.
    • Mean Read Length (bp): Average read length in bp of high quality sequencing reads.
    • HQ Reads %: High Quality Reads percentage.

Host Removal

The host removal is done using Kraken[9]. Kraken classifes reads by breaking each read into overlapping k-mers. Each k-mer is mapped to the lowest common ancestor (LCA) of the genomes containing that k-mer in a precomputed reference database . For each read, a classifcation tree is found by pruning the taxonomy and only retaining taxa (including ancestors) associated with k-mers in that read. Each node is weighted by the number of k-mers mapped to the node, and the path from root to leaf with the highest sum of weights is used to classify the read. KrakenUniq[1] computes the number of unique k-mers observed for each taxon, which allows to filter more false positives. The fastq files were filtered for non-host sequences using SeqKit[7] for further downstream analysis. The final host classifed, unclassifed and filter passed reads are reported in the table below.

  • Table 2: Host removal profile metrics per sample:
  • Table 3: Number of reads assigned to different host species per sample:

Taxonomic Profiling

Taxonomic profiling is done using MetaPhlAn[8]. MetaPhlAn (Metagenomic Phylogenetic Analysis) is a computational tool for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from metagenomic shotgun sequencing data with species level resolution. MetaPhlAn relies on unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic). Unclassifed reads are than subjected to KrakenUniq[1]. Kraken[9] classifes reads by breaking each read into overlapping k-mers. Each k-mer is mapped to the lowest common ancestor (LCA) of the genomes containing that k-mer in a precomputed reference database. For each read, a classifcation tree is found by pruning the taxonomy and only retaining taxa (including ancestors) associated with k-mers in that read. Each node is weighted by the number of k-mers mapped to the node, and the path from root to leaf with the highest sum of weights is used to classify the read. KrakenUniq computes the number of unique k-mers observed for each taxon, which allows to filter more false positives. The final classifed, unclassifed and filter passed reads are reported in the table below.

  • Table 4: Taxonomic profling metrics per sample:
  • Table 5: Number of reads assigned to different kingdoms per sample,
    • Ambigious: Reads which can not be assigned to one specific kingdom.
    • Eukaryota: Parasitic and non-parasitic Protozoa.

Taxa Abundance

Read counts of input samples observed at various taxa levels (Phylum, Genus, and Species) are collected and normalized by using the rarefy function implemented in the vegan bioconductor package[3] to compare species richness from all samples in the analysis run. Rarefied read counts enable better comparisons of OTU profiles between samples with different sample sizes. Abundance measured by the percentage of OTU assigned reads from various taxonomic levels is determined and are used to generate heatmaps and bar plots at Phylum, Genus and Species levels. Heatmap and bar plots representing the taxonomic abundance at species level are provided below.

Figure 1: Heat map(s) showing the taxonomic abundance and their relation across the samples. Dendrograms determined by computing hierarchical clustering from the abundance levels shows the relationship between the species (left) and the samples (top). The abundance levels (number of reads associated with each taxa) are logarithmically transformed to base 2 for clarity. Taxa-level: Species

Figure 2: Bar plot(s) showing the taxonomic abundance across the samples. Taxa-level: Species

Species Diversity

A diversity index is a quantitative measure that reflects how many different types (such as species) are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of species increases and when all species are present at nearly the same level. For a given number of species, the value of a diversity index is maximized when all species are equally abundant. The following diversity indices are computed using vegan[3] package in R. Simpson refers to Simpson diversity index and has values ranging from 0 to 1. Values near 1 are simple environments and smaller values are diverse environments. InvSimpson refers to inverse Simpson diversity and has values >0. A larger value means greater diversity. Shannon refers to Shannon diversity index and has values >0. A higher value means greater diversity. Alpha refers to Fischer’s model of predicting species richness by computing alpha diversity and has values >0. A larger value means greater diversity. Evenness refers to the distribution of individuals across species and is determined by Pielou’s measure of species evenness. The index tends to 0 as the evenness decreases in simple environments (species-poor communities). SpeciesNo refers to the absolute number of species found in each sample.

Figure 3: Various diversity indices computed based on the species counts found in each sample

Rarefaction Curves

Rarefaction allows the calculation of species richness for a given number of individual samples, based on the construction of rarefaction curves. This curve is a plot of the total number of distinct species found as a function of the number of sequences sampled. Sampling curves generally rise very quickly at first and then level off towards an asymptote as fewer new species are found in each sample. These rarefaction curves are calculated from the table of species abundance. The curves represent the average number of different species found for subsamples of the complete dataset.

Figure 4: Rarefaction curve of annotated species richness

Resistome Profiling

Profiling the collective antimicrobial resistance (AMR) within a metagenome is referred as resistome, which facilitates greater understanding of AMR gene diversity and dynamics in metagenomic environments.

Antimicrobial resistant genes (ARGs) from the metagenomic samples are screened using Graphing Resistance Out Of meTagenomes (GROOT[6]) software, which combines a variation graph representation of gene sets with a locality-sensitive hashing forest indexing scheme to allow for fast classification of metagenomic sequence reads using similarity-search queries. Subsequent hierarchical local alignment of classified reads against graph traversals enables accurate reconstruction of full-length gene sequences using a scoring scheme.

Reference ARG database contains >6000 well curated ARGs sourced from the public repositories.

Resistome Profile Results

Antimicrobial resistant genes and the associated antibiotic classes detected in each metagenomic sample are summarized in the following table.

  • Table 6: ARG profiling metrics per sample:

Supplementary Information : Detected Antiobiotics Appendix

  • Table 7: Resistome profiling appendix:

Relevant Programs

  • Table 8: The programs/softwares used in this pipeline.
Tool Version Description
fastp[2] 0.20.0 Fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files.
GROOT[6] 1.1.2 Indexed variation graphs for efficient and accurate resistome profiling
KrakenUniq[1] 0.5.8 KrakenUniq: confident and fast metagenomics classification using unique k-mer counts
KronaTools[4] 2.7.1 Interactive metagenomic visualization in a Web browser
MetaPhlAn[8] 3.0.7 MetaPhlAn for enhanced metagenomic taxonomic profiling
R[5] 4.1.3 R is a programming language and environment for statistical computing.
SeqKit[7] 0.12.0 SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation
vegan[3] 2.6.4 The functions in the vegan package contain tools for diversity analysis, ordination methods and tools for the analysis of dissimilarities

Sequence Data Used

  • Table 9: The samples used in this pipeline.

Filter Settings

  • Table 10: Filters used in postprocessing of taxonomic profiling results.

Deliverables

  • Table 11: List of delivered files, format and recommended programs to access the data.

Note: All the deliverables have been compressed and are available as tar.gz file with the file name EF-DEMO.Metagenome_Analysis_Results.tar.gz

References

[1] Florian P Breitwieser, DN Baker, and Steven L Salzberg. 2018. KrakenUniq: Confident and fast metagenomics classification using unique k-mer counts. Genome biology 19, 1 (2018), 1–10.

[2] Shifu Chen, Yanqing Zhou, Yaru Chen, and Jia Gu. 2018. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, 17 (September 2018), i884–i890. https://doi.org/10.1093/bioinformatics/bty560

[3] Jari Oksanen, F Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R Minchin, RB Ohara, Gavin L Simpson, Peter Solymos, M Henry H Stevens, Helene Wagner, and others. 2013. Package vegan. Community ecology package, version 2, 9 (2013), 1–295.

[4] Brian D Ondov, Nicholas H Bergman, and Adam M Phillippy. 2011. Interactive metagenomic visualization in a web browser. BMC bioinformatics 12, 1 (2011), 1–10.

[5] R Core Team. 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from http://www.R-project.org/

[6] Will PM Rowe and Martyn D Winn. 2018. Indexed variation graphs for efficient and accurate resistome profiling. Bioinformatics 34, 21 (2018), 3601–3608.

[7] Wei Shen, Shuai Le, Yan Li, and Fuquan Hu. 2016. SeqKit: A cross-platform and ultrafast toolkit for fasta/q file manipulation. PloS one 11, 10 (2016), e0163962.

[8] Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower, and Nicola Segata. 2015. MetaPhlAn for enhanced metagenomic taxonomic profiling. Nature methods 12, 10 (2015), 902–903.

[9] Derrick E Wood and Steven L Salzberg. 2014. Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome biology 15, 3 (2014), 1–12.



Eurofins Genomics Europe Sequencing GmbH • Jakob-Stadler-Platz 7 • 78467 Constance • GERMANY